1,488 research outputs found
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
Lipschitz equivalence of self-similar sets and hyperbolic boundaries
In [9] Kaimanovich introduced the concept of augmented tree on the symbolic
space of a self-similar set. It is hyperbolic in the sense of Gromov, and it
was shown in [13] that under the open set condition, a self-similar set can be
identified with the hyperbolic boundary of the tree. In the paper, we
investigate in detail a class of simple augmented trees and the Lipschitz
equivalence of such trees. The main purpose is to use this to study the
Lipschitz equivalence problem of the totally disconnected self-similar sets
which has been undergoing some extensive development recently.Comment: Advances in Mathematics, accepted (2012). 29 pages, 10 figure
SAINE: Scientific Annotation and Inference Engine of Scientific Research
We present SAINE, an Scientific Annotation and Inference ENgine based on a
set of standard open-source software, such as Label Studio and MLflow. We show
that our annotation engine can benefit the further development of a more
accurate classification. Based on our previous work on hierarchical discipline
classifications, we demonstrate its application using SAINE in understanding
the space for scholarly publications. The user study of our annotation results
shows that user input collected with the help of our system can help us better
understand the classification process. We believe that our work will help to
foster greater transparency and better understand scientific research. Our
annotation and inference engine can further support the downstream meta-science
projects. We welcome collaboration and feedback from the scientific community
on these projects. The demonstration video can be accessed from
https://youtu.be/yToO-G9YQK4. A live demo website is available at
https://app.heartex.com/user/signup/?token=e2435a2f97449fa1 upon free
registration.Comment: Under review in IJCNLP-AACL Demo 202
Hierarchical Classification of Research Fields in the "Web of Science" Using Deep Learning
This paper presents a hierarchical classification system that automatically
categorizes a scholarly publication using its abstract into a three-tier
hierarchical label set (discipline, field, subfield) in a multi-class setting.
This system enables a holistic categorization of research activities in the
mentioned hierarchy in terms of knowledge production through articles and
impact through citations, permitting those activities to fall into multiple
categories. The classification system distinguishes 44 disciplines, 718 fields
and 1,485 subfields among 160 million abstract snippets in Microsoft Academic
Graph (version 2018-05-17). We used batch training in a modularized and
distributed fashion to address and allow for interdisciplinary and interfield
classifications in single-label and multi-label settings. In total, we have
conducted 3,140 experiments in all considered models (Convolutional Neural
Networks, Recurrent Neural Networks, Transformers). The classification accuracy
is > 90% in 77.13% and 78.19% of the single-label and multi-label
classifications, respectively. We examine the advantages of our classification
by its ability to better align research texts and output with disciplines, to
adequately classify them in an automated way, and to capture the degree of
interdisciplinarity. The proposed system (a set of pre-trained models) can
serve as a backbone to an interactive system for indexing scientific
publications in the future.Comment: Under review in QS
Modulator-Dependent RBPs Changes Alternative Splicing Outcomes in Kidney Cancer
Alternative splicing alterations can contribute to human disease. The ability of an RNA-binding protein to regulate alternative splicing outcomes can be modulated by a variety of genetic and epigenetic mechanisms. In this study, we use a computational framework to investigate the roles of certain genes, termed modulators, on changing RBPs’ effect on splicing regulation. A total of 1,040,254 modulator-mediated RBP-splicing interactions were identified, including 137 RBPs, 4,309 splicing events and 2,905 modulator candidates from TCGA-KIRC RNA sequencing data. Modulators function categories were defined according to the correlation changes between RBPs expression and their targets splicing outcomes. QKI, as one of the RBPs influencing the most splicing events, attracted our attention in this study: 2,014 changing triplets were identified, including 1,101 modulators and 187 splicing events. Pathway enrichment analysis showed that QKI splicing targets were enriched in tight junction pathway, endocytosis and MAPK signaling pathways, all of which are highly associated with cancer development and progression. This is the first instance of a comprehensive study on how alternative splicing outcomes changes are associated with different expression level of certain proteins, even though they were regulated by the same RBP. Our work may provide a novel view on understanding alternative splicing mechanisms in kidney cancer
Experimental and theoretical study of the photoelectron spectra of MnOx-(x=1-3) clusters
We report a combined experimental and theoretical investigation of MnO−x and MnOx(x=1–3) clusters. Theoretically, geometrical configurations of various isomers of the clusters were optimized and vertical detachment energies for the anions were evaluated. The ground state of MnO− was predicted to be 5Σ+, followed by an excited state (7Σ+) 0.14 eV higher in energy. The ground state of MnO−2 is 5B2, with a 3B1 isomer 0.15 eV higher. MnO−3 is predicted to be a singlet D3h cluster. Vibrationally resolved photoelectron spectra of MnO−x were measured at several photon energies and under various experimental conditions, and were interpreted based on the theoretical results. The electron affinities of MnO, MnO2,and MnO3 were determined to be 1.375 (0.010), 2.06 (0.03), and 3.335 (0.010), respectively. Five excited states of MnO were observed and assigned using the theoretical results. The 7Σ+ excited state of MnO− was found to be significantly populated and was distinguished from the ground state of the anion by temperature dependent studies. We observed two isomers for MnO−2 and the detachment features from both isomers were assigned. Only one vibrationally resolved band was observed for MnO−3, which corresponds to transitions from the ground state of MnO−3 to that of MnO3. The combined experimental and theoretical studies allow us to elucidate the complicated electronic and geometricstructures of the various manganese oxide clusters and their anions
High Glucose Alters Fetal Rat Islet Transcriptome and Induces Progeny Islet Dysfunction
Offspring of diabetic mothers are susceptible to developing type 2 diabetes due to pancreatic islet dysfunction. However, the initiating molecular pathways leading to offspring pancreatic islet dysfunction are unknown. We hypothesized that maternal hyperglycemia alters offspring pancreatic islet transcriptome and negatively impacts offspring islet function. We employed an infusion model capable of inducing localized hyperglycemia in fetal rats residing in the left uterine horn, thus avoiding other factors involved in programming offspring pancreatic islet health. While maintaining euglycemia in maternal dams and right uterine horn control fetuses, hyperglycemic fetuses in the left uterine horn had higher serum insulin and pancreatic beta cell area. Upon completing infusion from GD20 to 22, RNA sequencing was performed on GD22 islets to identify the hyperglycemia-induced altered gene expression. Ingenuity pathway analysis of the altered transcriptome found that diabetes mellitus and inflammation/cell death pathways were enriched. Interestingly, the downregulated genes modulate more diverse biological processes, which includes responses to stimuli and developmental processes. Next, we performed ex and in vivo studies to evaluate islet cell viability and insulin secretory function in weanling and adult offspring. Pancreatic islets of weanlings exposed to late gestation hyperglycemia had decreased cell viability in basal state and glucose-induced insulin secretion. Lastly, adult offspring exposed to in utero hyperglycemia also exhibited glucose intolerance and insulin secretory dysfunction. Together, our results demonstrate that late gestational hyperglycemia alters the fetal pancreatic islet transcriptome and increases offspring susceptibility to developing pancreatic islet dysfunction
- …